Skip to content

feat: spec-compliant hashing + 36 golden parity tests#2

Merged
eywalker merged 4 commits intomainfrom
eywalker/plt-1043-prepare-nauticalabstarfix-python-to-reach-logical-hashing
Mar 20, 2026
Merged

feat: spec-compliant hashing + 36 golden parity tests#2
eywalker merged 4 commits intomainfrom
eywalker/plt-1043-prepare-nauticalabstarfix-python-to-reach-logical-hashing

Conversation

@agent-kurouto
Copy link
Contributor

@agent-kurouto agent-kurouto bot commented Mar 20, 2026

Summary

  • Rewrites arrow_digester.py to implement the byte-layout spec's BTreeMap decomposition for hash_array and the record-batch path:
    • Lists decompose into separate validity-only, structural, and data entries (spec Section 3.4)
    • hash_array now uses the same recursive decomposition as hash_record_batch (spec Section 6)
    • Struct null AND-propagation follows spec Section 3.5
  • Adds tests/test_golden_parity.py with 45 tests (36 unique golden hashes) verified against the Rust starfix crate, covering:
    • Spec examples A, B, C, D, E, F, G, H, I, J, K, N, O
    • All primitive types (bool, int8–uint64, float32, float64)
    • Temporal types (date32, date64, time32 s/ms, time64 us/ns)
    • Variable-length (binary, string) with type canonicalization
    • Decimal128 at three precision ranges (1–9, 10–18, 19–38)
    • List, List/LargeList equivalence
    • Struct column, nested struct (two levels), struct with nullable children
    • Multi-type record batch (24 columns), schema-only hash
    • Streaming two-batch incremental update, batch-split independence
  • Fixes list hash_array bug: Previously used a single combined entry instead of the spec's decomposed entries, producing incorrect hashes for any list-typed array

Test plan

  • All 40 existing tests pass (0 regressions)
  • All 45 new golden parity tests pass
  • Golden values generated independently by cargo run --bin emit_golden in Rust starfix (post-PR #11)

Resolves PLT-1043

🤖 Generated with Claude Code

…den parity tests

Rewrites the core hashing algorithm to match the byte-layout spec:
- hash_array now uses the same recursive BTreeMap decomposition as the
  record-batch path (spec Section 6), fixing list and struct hashing
- Lists decompose into separate validity-only, structural, and data
  entries (spec Section 3.4) instead of a single combined entry
- Recursive traversal properly AND-propagates struct nulls to descendants
  (spec Section 3.5)
- Adds 45 golden-hash parity tests covering all primitive types, temporal
  types, variable-length types, decimals, lists, structs, nested structs,
  multi-type batches, and streaming — all verified against Rust starfix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python ArrowDigester implementation to follow the byte-layout spec’s recursive BTreeMap-style decomposition for hashing Arrow arrays and record batches, and adds a comprehensive golden-parity test suite to ensure byte-for-byte hash equality with the Rust implementation.

Changes:

  • Reworked ArrowDigester hashing to use a recursive traversal + decomposed digest-entry model for lists/structs (shared between hash_array and record-batch hashing).
  • Added extensive golden parity tests covering many Arrow types and spec examples, with expected hashes generated from the Rust crate.
  • Updated the existing list-array golden hash to reflect the corrected spec-compliant behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/starfix/arrow_digester.py Major rewrite of hashing logic and decomposition/traversal implementation to match the spec.
tests/test_golden_parity.py New golden parity regression suite (Rust-authoritative expected hashes).
tests/test_arrow_digester.py Updates an existing list hash expectation to the new correct value.
.python-version Pins local Python version to 3.12.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""Pure-Python implementation of the starfix Arrow logical hasher.

Produces identical hashes to the Rust implementation for all supported types.
Implements the byte-layout specification from ``docs/byte-layout-spec.md``.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring references docs/byte-layout-spec.md, but there is no docs/ directory (and no file with that name) in this repository. This makes the reference non-actionable for maintainers; please either add the spec file at that path or update the reference to the correct location (e.g., a README section or an in-repo spec file).

Suggested change
Implements the byte-layout specification from ``docs/byte-layout-spec.md``.
Implements the Arrow byte-layout specification used by starfix.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the spec lives in the Rust repo, not here. Fixed the docstring to reference nauticalab/starfix docs/byte-layout-spec.md instead of a non-existent local path.

children.sort(key=lambda f: f.name)
for child in children:
child_path = f"{path}{DELIMITER}{child.name}" if path else child.name
_extract_type_entries(child.type, child.nullable, child_path, out)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct null propagation seems incomplete at the type-decomposition stage: when a struct field is nullable, all descendant entries become effectively nullable (struct validity AND child validity per spec 3.5). Here _extract_type_entries() recurses into children using only child.nullable, so non-nullable children of a nullable struct will not get a null_bits buffer and their propagated null positions won’t be represented in the final digest (only skipped in data hashing). Consider recursing with child.nullable or nullable so child entries include null_bits whenever any ancestor struct is nullable.

Suggested change
_extract_type_entries(child.type, child.nullable, child_path, out)
_extract_type_entries(child.type, child.nullable or nullable, child_path, out)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional and matches both the spec and the Rust implementation.

The spec (Section 3.5, Example M) says struct-level nulls are AND-propagated at traversal time, not at entry-creation time. Non-nullable children of a nullable struct get data-only entries (no null_bits). The null propagation is handled by _traverse_struct which passes combined ancestor nulls down — _traverse_leaf then skips null rows when hashing data bytes.

From spec Example M:

Entry "a" (Int32, non-nullable → data-only): Struct nulls propagated: rows 0, 2 valid → data: [10, 30]

The Rust extract_type_entries also passes child_field.is_nullable() (not child.nullable || parent.nullable) — see arrow_digester_core.rs:934. All 126 parity tests (including nullable struct cases like r2_nullable_struct_batch and r2_nested_nullable_inner_struct) confirm this produces correct hashes.

Comment on lines +314 to +315
inner = _normalize_field(dt.key_field)
return pa.map_(inner.type, dt.item_field.type)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_data_type()’s Map handling drops important information and may produce an incorrect canonical type: it only normalizes the key field (not the item/value field), it doesn’t preserve keys_sorted, and it reconstructs the map using pa.map_() defaults. If maps are intended to be supported, this should normalize both key/item fields and carry over keys_sorted to avoid schema/type hashes diverging from Arrow/Rust for map schemas.

Suggested change
inner = _normalize_field(dt.key_field)
return pa.map_(inner.type, dt.item_field.type)
key_field = _normalize_field(dt.key_field)
item_field = _normalize_field(dt.item_field)
return pa.map_(key_field.type, item_field.type, keys_sorted=dt.keys_sorted)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the item field was not being normalized. Fixed to normalize both key and item fields, and preserve keys_sorted. Map types are not yet exercised by the test suite, but this is the correct normalization per the spec.

…sorting

Adds test_golden_parity_r2.py with 41 additional golden-hash tests covering:
- Empty and all-null arrays
- Multi-byte bitvec (17-element boolean)
- Single-element edge cases
- Nullable struct via hash_array (struct-level nulls)
- Three-level nested structs (batch, schema, hash_array)
- Struct with list children
- List<Struct> and nullable List<Struct> via hash_array
- List<List<Int32>> nested lists (hash_array, batch, schema)
- Mixed struct+list+scalar record batches
- Unicode strings and varied-size binary
- Float64 special values (NaN, ±inf, ±0.0)
- Multi-batch streaming with struct and list columns
- Complex schema-only hashes (12 fields, 4-level nesting)
- Realistic geospatial-style record batch

Also fixes a bug where hash_array did not sort JSON keys in the type
metadata for struct types, causing struct hash_array to diverge from
Rust. The spec requires all JSON keys sorted alphabetically at every
nesting level.

NOTE: One Rust bug discovered and filed as PLT-1047 — Rust panics on
hash_array with List<nullable leaf> when inner elements contain nulls
(sliced array null_count mismatch in traverse_leaf).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +248 to +256
if pa.types.is_struct(canonical):
# Struct is transparent — no entry for the struct itself.
# Recurse into children sorted alphabetically (spec Section 3.5).
children = [canonical.field(i) for i in range(canonical.num_fields)]
children.sort(key=lambda f: f.name)
for child in children:
child_path = f"{path}{DELIMITER}{child.name}" if path else child.name
_extract_type_entries(child.type, child.nullable, child_path, out)

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_type_entries() treats structs as “transparent” and only uses each child field’s declared nullable to decide whether to allocate null_bits. If the parent struct itself is nullable, nulls must still be AND-propagated to descendants (spec 3.5), which requires descendant entries to include validity (null_bits) even when the child field is declared non-nullable. As written, nullable structs with non-nullable children will lose null position information (children have null_bits=None). Consider propagating nullable into the recursive calls (e.g., pass nullable or child.nullable) so descendants can record effective validity.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point as the earlier review — addressed in the reply there. This is intentional per the spec: struct null propagation happens at traversal time (via _traverse_struct passing combined ancestor nulls), not at entry-creation time. Non-nullable children of a nullable struct get data-only entries. See spec Example M and Rust extract_type_entries at line 934. Verified by 126 parity tests.

from decimal import Decimal

import pyarrow as pa
import pytest
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but not used in this test module. Removing unused imports helps keep the test code clean and avoids lint warnings in environments that enforce them.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already removed in a subsequent commit — neither test_golden_parity.py nor test_golden_parity_r2.py imports pytest in the current version.

from __future__ import annotations

import pyarrow as pa
import pytest
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but not used in this test module. Consider removing it to avoid unused-import warnings in linted environments.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — pytest is not imported in the current version of this file.

- Fix docstring to reference spec in the Rust repo (not local path)
- Normalize both key and item fields in Map type, preserve keys_sorted

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agent-kurouto
Copy link
Contributor Author

agent-kurouto bot commented Mar 20, 2026

Response to Copilot review

All three comments addressed in commit 4bde6a1:

  1. Docstring reference — Fixed. Now references nauticalab/starfix docs/byte-layout-spec.md instead of a non-existent local path.

  2. Struct null propagation in _extract_type_entries — Intentional, no change. The spec (Section 3.5, Example M) propagates struct nulls at traversal time via _traverse_struct, not at entry-creation time. Non-nullable children of a nullable struct get data-only entries (no null_bits). This matches the Rust implementation (arrow_digester_core.rs:934) and is verified by 126 parity tests including r2_nullable_struct_batch and r2_nested_nullable_inner_struct.

  3. Map normalization — Fixed. Now normalizes both key and item fields, and preserves keys_sorted. Map types aren't exercised by tests yet, but the fix is correct per the spec.

Runs pytest across Python 3.10, 3.11, 3.12 using uv on push to main
and on pull requests. Covers all 126 golden parity + unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eywalker eywalker merged commit b2fd8ca into main Mar 20, 2026
3 checks passed
@eywalker eywalker deleted the eywalker/plt-1043-prepare-nauticalabstarfix-python-to-reach-logical-hashing branch March 20, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants